AITopics | steepest descent

Collaborating Authors

steepest descent

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient

Li, Jichu, Tang, Xuan, Zou, Difan

arXiv.org Machine LearningJun-18-2026

A variety of widely used optimization methods like SignSGD and Muon can be interpreted as instances of steepest descent under different norm-induced geometries. In this work, we study the implicit bias of mini-batch stochastic steepest descent in multi-class classification, characterizing how batch size, momentum, and variance reduction shape the limiting max-margin behavior and convergence rates under general entry-wise and Schatten-$p$ norms. We show that, without momentum, worst-case convergence and successful classification can only be guaranteed with full-batch gradient. In contrast, momentum enables small-batch convergence to an approximate max-margin solution through a batch-momentum trade-off, though it slows convergence. This approach provides fully explicit, dimension-free rates that improve upon prior results. Moreover, we prove that variance reduction can recover the exact full-batch implicit bias for any batch size, albeit at a slower convergence rate. Finally, we further investigate the batch-size-one steepest descent without momentum, and reveal its convergence to a fundamentally different bias via a concrete data example, which reveals a key limitation of purely stochastic updates. Overall, our unified analysis clarifies when stochastic optimization aligns with full-batch behavior, and paves the way for perform deeper explorations of the training behavior of stochastic gradient steepest descent algorithms.

artificial intelligence, implicit bias, machine learning, (16 more...)

arXiv.org Machine Learning

2602.11557

Country: Asia (0.46)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data

Neural Information Processing SystemsJun-11-2026, 16:05:08 GMT

Different gradient-based methods for optimizing overparameterized models can all achieve zero training error yet converge to distinctly different solutions inducing different generalization properties. We provide the first complete characterization of implicit optimization bias for p-norm normalized steepest descent (NSD) and momentum steepest descent (NMD) algorithms in multi-class linear classification with cross-entropy loss. Our key theoretical contribution is proving that these algorithms converge to solutions maximizing the margin with respect to the classifier matrix's p-norm, with established convergence rates. These results encompass important special cases including Spectral Descent and Muon, which we show converge to max-margin solutions with respect to the spectral norm. A key insight of our contribution is that the analysis of general entry-wise and Schatten p-norms can be reduced to the analysis of NSD/NMD with max-norm by exploiting a natural ordering property between all p-norms relative to the max-norm and its dual sum-norm. For the specific case of descent with respect to the max-norm, we further extend our analysis to include preconditioning, showing that Adam converges to the matrix's max-norm solution. Our results demonstrate that the multi-class linear setting, which is inherently richer than the binary counterpart, provides the most transparent framework for studying implicit biases of matrix-parameter optimization algorithms.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.59)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.59)
Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Gronich, Eitan, Vardi, Gal

arXiv.org Machine LearningMar-4-2026

We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

2602.1634

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants

Crawshaw, Michael, Modi, Chirag, Liu, Mingrui, Gower, Robert M.

arXiv.org Machine LearningOct-14-2025

To define a steepest descent method over a neural network, we need to choose a norm for each layer, a way to aggregate these norms across layers, and whether to use normalization. We systematically explore different alternatives for aggregating norms across layers, both formalizing existing combinations of Adam and the recently proposed Muon as a type of non-Euclidean gradient descent, and deriving new variants of the Muon optimizer. Through a comprehensive experimental evaluation of the optimizers within our framework, we find that Muon is sensitive to the choice of learning rate, whereas a new variant we call MuonMax is significantly more robust. We then show how to combine any non-Euclidean gradient method with model based momentum (known as Momo). The new Momo variants of Muon are significantly more robust to hyperparameter tuning, and often achieve a better validation score. Thus for new tasks, where the optimal hyperparameters are not known, we advocate for using Momo in combination with MuonMax to save on costly hyperparameter tuning.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2510.09827

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

The Method of Infinite Descent

Batley, Reza T., Saha, Sourav

arXiv.org Artificial IntelligenceOct-8-2025

Training - the optimisation of complex models - is traditionally performed through small, local, iterative updates [D. E. Rumelhart, G. E. Hinton, R. J. Williams, Nature 323, 533-536 (1986)]. Approximating solutions through truncated gradients is a paradigm dating back to Cauchy [A.-L. Cauchy, Comptes Rendus Mathématique 25, 536-538 (1847)] and Newton [I. Newton, The Method of Fluxions and Infinite Series (Henry Woodfall, London, 1736)]. This work introduces the Method of Infinite Descent, a semi-analytic optimisation paradigm that reformulates training as the direct solution to the first-order optimality condition. By analytical resummation of its Taylor expansion, this method yields an exact, algebraic equation for the update step. Realisation of the infinite Taylor tower's cascading resummation is formally derived, and an exploitative algorithm for the direct solve step is proposed. This principle is demonstrated with the herein-introduced AION (Analytic, Infinitely-Optimisable Network) architecture. AION is a model designed expressly to satisfy the algebraic closure required by Infinite Descent. In a simple test problem, AION reaches the optimum in a single descent step. Together, this optimiser-model pair exemplify how analytic structure enables exact, non-iterative convergence. Infinite Descent extends beyond this example, applying to any appropriately closed architecture. This suggests a new class of semi-analytically optimisable models: the \emph{Infinity Class}; sufficient conditions for class membership are discussed. This offers a pathway toward non-iterative learning.

artificial intelligence, infinite descent, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.05489

Country: North America > United States > Virginia (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

All reviewers AC

Neural Information Processing SystemsAug-17-2025, 06:57:45 GMT

(Lemma 1). (Theorem 1). Reviewer 1 Thank you for your valuable comments. Table 1 was a surprising empirical observation. On dividing out the gradient scale --this approach (taken by Adam) requires more learning rate tuning than Fromage.

breakdown, constant factor, steepest descent, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.54)

Add feedback

Old Optimizer, New Norm: An Anthology

Bernstein, Jeremy, Newhouse, Laker

arXiv.org Artificial IntelligenceDec-6-2024

Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of $\mathbb{R}^{m\times n}$, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2409.20325

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Faster Acceleration for Steepest Descent

Bai, Site, Bullins, Brian

arXiv.org Machine LearningSep-27-2024

We propose a new accelerated first-order method for convex optimization under non-Euclidean smoothness assumptions. In contrast to standard acceleration techniques, our approach uses primal-dual iterate sequences taken with respect to differing norms, which are then coupled using an implicitly determined interpolation parameter. For $\ell_p$ norm smooth problems in $d$ dimensions, our method provides an iteration complexity improvement of up to $O(d^{1-\frac{2}{p}})$ in terms of calls to a first-order oracle, thereby allowing us to circumvent long-standing barriers in accelerated non-Euclidean steepest descent.

convex optimization, nesterov, optimization, (12 more...)

arXiv.org Machine Learning

2409.192

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

Add feedback

Is All Learning (Natural) Gradient Descent?

Shoji, Lucas, Suzuki, Kenta, Kozachkov, Leo

arXiv.org Artificial IntelligenceSep-24-2024

This paper shows that a wide class of effective learning rules -- those that improve a scalar performance measure over a given time window -- can be rewritten as natural gradient descent with respect to a suitably defined loss function and metric. Specifically, we show that parameter updates within this class of learning rules can be expressed as the product of a symmetric positive definite matrix (i.e., a metric) and the negative gradient of a loss function. We also demonstrate that these metrics have a canonical form and identify several optimal ones, including the metric that achieves the minimum possible condition number. The proofs of the main results are straightforward, relying only on elementary linear algebra and calculus, and are applicable to continuous-time, discrete-time, stochastic, and higher-order learning rules, as well as loss functions that explicitly depend on time.

descent, gradient, loss function, (15 more...)

arXiv.org Artificial Intelligence

2409.16422

Country: